A study of accuracy and precision in oligonucleotide arrays: 
extracting more signal at large concentrations 

Felix Naef^, Nicholas D. Socci^ and Marcelo Magnasco^ 
^Center for Studies in Physics and Biology, ^Laboratory for Molecular Genetics, 
Rockefeller University, 1230 York Avenue, NY 10021 

February 2, 2008 



Abstract 



Introduction 



Motivation: Despite the success and popularity 
of ohgonucleotide arrays as a high-throughput tech- 



High-density ohgonucleotide arrays manufactured by 
Affymetrix are among the most sensitive and reliable 



uique foi' ineayui'ing inRNA expi'esaioii levela, quan- 
LlLaLive calibi'aLioii sLudicy have uiiLil now been lim- 
ited. The main reason is that suitable data was not 
available. However, calibration data recently pro- 
duced by Affymetrix now permits detailed studies of 
the intensity dependent sensitivity. Given a certain 
transcript concentration, it is of particular interest 
to know whether current analysis methods are capa- 
ble of detecting differential expression ratios of 2 or 
higher . 

Results: Using the calibration data, we demonstrate 
that while current techniques are capable of detect- 
ing changes in the low to mid concentration range, 
the situation is noticeably worse for high concentra- 
tions. In this regime, expression changes as large as 
4 fold are severely biased, and changes of 2 are of- 
ten undetectable. Such effects are mainly the con- 
sequence of the sequence specific binding properties 
of probes, and not the result of optical saturation in 
the fluorescence measurements. GeneChips are man- 
ufactured such that each transcript is probed by a 
set of sequences with a wide affinity range. We show 
that this property can be used to design a method 
capable of reducing the high intensity bias. The idea 
behind our methods is to transfers the weight of a 
measurement to a subset of probes with optimal lin- 
ear response at a given concentration, which can be 
achieved using local embedding techniques. 

Availability: Program source code will be sent elec- 
tronically upon request. 

Contact: felix@funes.rockefeller.edu; 

soccin@rockefeller.edu; 

marcelo@zahir.rockefeller.edu 



microarray technology (Chee et at, 1996; Lipshutz 



et ai, 1999) available. Based on a photolithographic 



oligonucleotide deposition process, labeled and am- 
plified mRNA transcripts are probed by 22-40 (de- 
pending on chip models) short DNA sequence each 
25 bases long. The probes are preferentially picked 
near the 3' end of the mRNA sequence, because of 
the limited efficiencies of reverse transcription en- 
zymes. In addition, the probes come in two varieties: 
half are perfect matches (PM) identical to templates 
found in databases, and the other half are single mis- 
matches (MM), carrying a single base substitution in 
the middle (13th) base of the sequence. MM probes 
were introduced to serve as controls for non-specific 
hybridization, and most analysis methods postulate 
that the actual signal (the target's mRNA concen- 
tration) to be proportional to the difference of match 
versus mismatch (PM-MM). 

The purpose of this work is twofold. First, we 
present a detailed calibration study of GeneChips. 
Specifically, we apply latest analysis methods (MAS 
5.0 algorithm and others) to a large yeast calibra- 
tion dataset, in which a number of transcripts are 
hybridized at known concentrations. We investigate 
the concentration dependence of both the accuracy 
and precision of differential expression scores. By ac- 
curate, we mean that the reported numerical ratios 
values are close to the known expression ratios, and 
have therefore little bias. On the other hand, a mea- 
surement is precise if it has a low noise level, also re- 
ferred to as small variance. Our results show that the 
ability of conventional analysis techniques to detect 
small changes strongly deteriorates toward high tran- 
script concentrations. While the variance is smallest 
for high concentrations, it appears that the question 
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of the bias in this regime has been neglected. In fact, 
the bias is strong enough that real changes of 2 (even 
4) can often not be detected. This sounds at first 
counter-intuitive, which we believe is rooted in the 
following widespread interpretation of hybridization 
data. Namely, when examining the data from two 
replicated conditions (Figure ^) most would focus 
on the low intensity region, and observe how noisy 
this regime appears to be in comparison to the high- 
intensity tail. However, this view is misleading, as it 



duos noL consider Lhc qucsLioii of Lhc bias. Turuiug Lu 
•d, cuniparisun uf Lwu dllfurcuL coiidiLluus (Figure |l|b), 
we notice that the noise envelope is essentially un- 
changed, and that real changes appear as points ly- 
ing distinctively outside the noise cloud. Looking 
at multiple such comparisons, we would then con- 
clude that the high intensity data is almost always 
very tightly scattered about the diagonal, and that 
there are rarely genes in that region that show fold 
changes greater than, say, 1.5 or 2. The interpre- 
tation that no differential regulation occurs in highly 
expressed transcripts seems unlikely. In fact, we show 
evidence that real changes are often compressed for 
large concentrations. This saturation effect can actu- 
ally be observed in Affymetrix's own data|^ although 
the issue is not commented there (a qualitative re- 
port has been given in (Chudin et ai, 2001)). The 
physical origin for the compression effect invokes non- 
linear probe affinities and chemical saturation. This 
is a separate issue from optical saturation (cf. Re- 
sults). Chemical saturation occurs below the detec- 
tor threshold and is attributed to the fact that some 
probes will exhaust their binding capacities at rela- 
tively low concentrations, simply because their bind- 
ing affinities are high. Binding affinities are in fact 
very sensitive to the sequence composition, resulting 
in measured brightnesses that usually vary by several 
decades within a given probeset ( Naef et ai, 2002 ). 

Our second goal is to present an analysis method 
that reduces the bias at high concentrations. Our 
approach uses all PM and MM probes equally, in 
contrast to the standard view in which the PMs are 
thought to carry the signal, while the MMs serve as 
non-specific controls. In fact, it has become clear that 
the MM probes also track the signal, usually with 
lower ( although often wit h higher) affinities than the 
PMs ( |Naef et ai, 2002|) . In that sense, the MMs 
should be viewed as a set of on average lower affinity 
probes. It is then reasonable to expect that some MM 
probes will more accurate at high intensities, since 
they will be less affected by saturation than the the 



PMs (cf. Figure 



Methods 

The existing methods for the analysis of the raw 
data fall into two main classes. The first methods 
are similar to Affymetrix's Microarray Suite software, 
providing absolute intensities on a chip by chip ba- 
sis, or differential expression ratios from two exper- 



iments (Affymetrix, 2001; Naef et 



2001 



et al, 2002|). The second class are called "model- 



based" approaches ( |Li fc Wong, 2001| ), and attempt 
to fit the probe affinities from a large number of ex- 
periments. 

The method described below belongs to the second 
class and is specifically designed for improved accu- 
racy in the compressive high-intensity regime. It is 
based on ideas borrowed from the theory of locally 



linear embeddings (Roweis & Saul, 200C) 



Notation 

We construct the following matrix 



PMi 



MM^f ^p) Np< j < 2Np 



or in expanded notation 



A' = 



PM\ 



PMf" 



MM\ 



MMf" 



PMl 



MM^, 



PM\ 



PM 



MM 



MM 



which contains the raw, background subtracted and 
normalized data. By background, we mean fluo- 
rescence background, which we identify by fitting a 
Gaussian distribution to the subset of all (-PM, MM) 
pairs satisfying the criterion that \PM — MM\ < e, 
with e = 50. This provides us with the mean and SD 
in the background fiuorescence. For a fair com- 
parison of compression effects in various methods, we 
used the global normalization factors from the MAS 
5.0 software in all cases, however, the technique re- 
mains applicable with other normalization schemes. 

Np is the number of probe pairs and N,, is the num- 
ber of experiments. We introduce a set of weights Wi 
such that 

JVe 

^ = 1 



^Figure 7 at 



tittp: //www . affymetrix . com/products/algor itlms_tech.html http: / /xxx. lanl.gov/abs/physics/010201C 



^for details, see 
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and define the column means (or center of mass) measure for the entire probeset 



J2 ^^ l0g(^D 



Note, we are computing the mean of the logs of the 
components of Aj . 

Local principal component analysis 

Local embeddings are adequate in situations where 
compression is important because non-linearities (re- 
sulting from chemical saturation) affect the one- 
dimensional manifold {FM-'(c), MM-' (c)} (the con- 
centration c is the one-dimensional 'curve' parame- 
ter) by giving it a non-zero local curvature. The re- 
sults section, in particular Figure || contains ample 
evidence that these non-linearities are significant (cf. 
also Figure 2 in ( IChudin et al, 200 1| )). Our method 
is a multidimensional generalization of the schematic 
depicted in Figure which shows the typical situa- 
tion of two probes in which one of the probes {PM2) 
saturates at concentrations lower than the other. If 
both probes were perfectly linear, the curve would be 
a straight line with slope 1. In the multidimensional 
case, the directions of largest variation (analogous to 
Dl or D2 in Figure |^) are computed from the princi- 
pal components of the matrix 



E 

k=l 



U^k Dk Vi 



which can easily be done via singular value decom- 
position (SVD). In order to reconstruct the concen- 
trations, one needs first to consider the unspecified 
sign of the vector V/ (when returned by the SVD 
routine), which has to be chosen such that the total 
amount of signal comes out positive. This is eas- 
ily achieved by adjusting the sign of Vl such that 
Yli j log(^i) vl > Q . The logarithm of the concen- 
tration Si, is then computed by projecting the orig- 
inal matrix onto the first principal component V( , 
corrected by a factor Vmax- This factor accounts 
for the fact that the vector V( is i2-normalized 
(Ej (^Z)^ = 1) by definition in the SVD. The sig- 
nal then reads: 



W Y.\og{A^)Vl 

3 = 1 



where Vmnr — max. 



VI 



. In addition, the above pro- 



N 



2 



where {Dk} are the singular values. Large S/N val- 
ues imply that the probeset measurements in the Ng 
experiments had a well-defined direction of variation, 
and can for instance be used as a filter for identify- 
ing genes that exhibit significant changes across the 
experiments tested. In the results shown in Figure ^, 
we used the following weights 



Wi 



W\ l/(l + 6(z-z,)2) 



< r 



cedure automatically yields a signal-to-noise (S/N) 



where W — Wi, Si is the signal obtained with 
uniform weights, is = 20 (out of 28 experiments), 
6 = 2, and F = S'i^+i with Si being the ascen- 
dantly sorted {s^}. In other words, lower concen- 
tration points are suppressed according to their rank 
(computed with uniform weights) using a slowly de- 
caying Cauchy weight function. There are of course 
other weight functions that could serve the same pur- 
pose. 

Note that the fitting procedure used in the Li- 
Wong ( Li fc Wong, 2001 ) method is identical to an 
SVD decomposition, however, with different input 
data than was use here. The three main difference 
between our method and the Li- Wong technique are: 
(i) in the analysis here, we used log transformed PM 
and MM intensities, rather than the bare PM-MM 
values; (ii) we introduced optional weights, which can 
account for non-linearities of the probe response in 
the high concentration regime; (iii) we subtract the 
column mean before we compute the principal com- 
ponents, which is crucial for capturing the local direc- 
tions of variation. Indeed, as can be seen in Figure ^, 
the principal component would be dominated by the 
mean itself without subtraction. 

Results 
Data sets 

The yeast Latin square (LS) experiment is a calibra- 
tion data set produced by Affymetrix, that uses a 
non-commercial yeast chip. 14 groups of 8 differ- 
ent genes, all with different probe sets, are spiked 
onto 14 chips at concentrations, in pM, correspond- 
ing to all cyclic permutations of the sequence (0, 0.25, 
0.5, 1, 2, . . ., 1024). Hence, each group is probed 
at 13 different finite concentrations, logarithmically 
spaced over a range of magnitudes from 1 to 4096 
(in Figures ^ and ^, we refer to these concentrations 
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as (1 = 0.25 pM, 2 = 0.5pM ... 13 = 1024 pM), ware and th e "2 chips" method discussed in ( [Naei 
and each group is completely absent in one array, et ai, 2002). The later method computes for each 



Besides the spiked-in target cRNA's, a human RNA 
background was added to mimic cross-hybridization 
effects that would occur in a real experiment. In ad- 
dition, each experiment was hybridized twice leading 
to 2 groups of 14 arrays called Rl and R2. 

The reason why this dataset is attractive as com- 
pared to the similar human and e. coli datat avail- 
able at www.netaffx.com are several. First, the e. 
coli data exhibits severe optical saturation, which in- 
terfers with the chemical saturation issue we are try- 
ing to address here. The yeast dataset, on the other 
hand, has virtually no optically saturated cells, as can 
be inferred the SD in the pixel intensities reported in 
the raw data (.CEL) files. In total, fewer than 0.1% of 
the probes have SD = {SD = characterizes opti- 
cal saturated cells). Further, optical saturation is no 
longer an issue in GeneChips with the current scan- 
ner settings. More important, the present dataset 
permits far better statistics, as the number of spiked- 
in genes is 112 as compared to 14 for the human chip. 
In the latter dataset, there is only one transcript per 
concentration group as compared to 8 in the yeast 
case. In fact, we verified that the compression effects 
discussed below are virtually identical in the human 
case (not shown). 

Summary of two-array methods 

The figures in the results section show the log-ratios 
as function of concentration in the form of boxplots. 
In these plots, the central rectangle indicates the 
median, 1st and 3rd quartiles {Ql and Q3). The 
"whiskers" extend on either side up to the last point 
that is not considered an outlier. Outliers are explic- 
itly drawn and are defined as laying outside the inter- 
val [Ql-1.5*/Q, Q3-l-1.5*/Q], where IQ = QS-Ql 
is the interquartile distance. For each method, we 
show three plots, the top two measure the false nega- 
tive rate for ratios of 2 and 4 fold respectively, and the 
last one shows the false positive rate. For the top two 
plots, all combinations (within Rl and R2 separately) 
of arrays leading to ratios of 2 and 4 were considered, 
and plotted as function of their baseline intensity (the 
lesser of the two concentrations). For the third, each 
gene was compared between the groups Rl and R2, 
at identical concentrations. Of the 8 * 14 = 112 tran- 
scripts, 8 were left out of the analysis because they 
did not lead to a signal that was tracking the con- 
centrations at all (presumably due to bad probes or 
transcript quality). 

In Figure ^ we summarize the results obtained by 
the Microarray Analysis Suite 5.0 (MAS 5.0) soft- 



gene probed in two arrays a ratio score R such that 



robust 



log(i?) = J2 log(^') 



is a robust geometric mean (a least trimmed squares 
estimator) of the probe ratios . Figure |^ shows the 
cases where 



R^ 



PM'i 



MM{ 



and 



R^ 



MM{ 



In both cases, only probes with numerator and de- 
nominator above background are retained. The first 
case {PM — MM) is in essence similar to the MAS 5.0 
program, differences are in the choice of the Tuckey 
bi- weight mean as the robust estimator, and in the 
treatment of negative differences. For our purpose 
here, we like to think of the Affymetrix method as 
two-array, PM — MM based method. In all the results 
presented below, the arrays were scaled according to 
the MAS 5.0 default settings. 

The main features of Figure ^ are: there is an op- 
timal range of baseline concentrations (w 1 — 16 pM) 
in which the ratio values from both PM — MM meth- 
ods (the two first columns) are fairly accurate, for 
both ratios of 2 and 4. For both lower and higher 
concentrations, there is a noticeable compression ef- 
fect, which is most dramatic at the high end. At the 
highest baseline concentration (512 pM for the ra- 
tios of 2 and 256 pM for ratios of 4), changes of 2 
are basically not detected and real changes of 4 are 
compressed on average to values around 1.25. The 
analysis of the false positive rate (last row) shows 
that both methods yield very tight reproducibility: 
the log2 ratio distributions are well centered around 
and the interquartile distances are roughly inten- 
sity independent and smaller than 0.2, meaning that 
50% of the measurements fall in the ratio interval 
[0.93, 1.07]. To be fair, we should point out that as a 
{PM - MM) method, the MAS 5.0 algorithm is on 
average a bit cleaner, having slightly fewer outliers. 
However, we like to emphasize that the qualitative be- 
havior in the two {PM — MM) methods is unchanged, 
especially as far as the high-intensity compression is 
concerned. Further, similar behavior is also found 
using the {PM - MM) Li- Wong method (data not 
shown). The above observations are consistent with 
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what was reported in ( |Chudin et al, 200l| ), confirm- 
ing that these effects are independent of the chip se- 
ries. 

The third column in Figure || illustrates our con- 
tention that the MM are in essence a set of lower 
affinity probes. We notice that using only the MM 
measurements in the two-array method changes the 
picture qualitatively. Whereas the low concentration 
regime is far worse than in the {PM — MM) meth- 
ods, the behavior toward the high end has changed 
and the drop off occurs now at higher concentrations: 
approximately 256 pM for the ratios of 2 and 128 pM 
for ratios of 4. On the other hand, even in the optimal 
range, the magnitude of the medians are always a bit 
lower than the real ratios, and the false positive rate 
also suffers. To summarize, this result suggests that 
if one is interested in accuracy at high concentrations, 
then the MM-only methods offers the best two-array 
alternative. We have tried other variations: PM only, 
or the double size set consisting in the merged PM 
and MMs, both being worse at high concentrations 
than the MM only method. 

Multi-array methods 

The data analyzed using our new method is shown in 
Figure ^. It is clear that both are capable of reducing 
the high intensity compression, as compared to exist- 
ing methods. The second column explicitly shows 
the higher accuracy of the local method. It should 
be noted, however, that the precision is significantly 
lower than with MAS 5.0, which is the trade-off to 
pay for higher signal detection. As compared to the 
"two chip" MM method, which was previously the 
least compressive in this regime, the medians are sys- 
tematically more accurate. Also, the method does 
not perform well at low-concentrations which is ex- 
pected since it was not designed for that range. 

Significance scores 

Although ratio score may suffer severe compression, 
there remains the possibility that they would be at- 
tributed a significant increase or decrease call. Fig- 
ure ^ displays the relation between the MAS 5.0 log- 
ratios and their associated p-values. MAS 5.0 change 
'p- values' pm are symmetric about 0.5 and designed 
such that the ratio score is called increased when 
Pm < 7i and decreased (D) when pi\j > 1 — 71, with a 
default 71 = 0.0025. This definition is not well suited 
for plotting purposes, we therefore work with pmas = 
Pm when pM < 0.5, and pmas = 1 — Pm otherwise. 
This way, both I and D genes have pmas < 71, the 
direction being given by the sign of the log-ratio. The 



results show that there remarkably few false positive 
calls: only 4 out of 624 for concentrations c < 8 pM, 
and 6 of 728 when c > 16 pM. Fold changes of 4 are 
also well detected despite the compression at high 
intensities: there are 21 false negatives (and 3 false 
positives having ratios with the wrong sign) out of 
1248 for c < 8pM, and 84 of 1040 false negatives 
for c > 16 pM. The situation deteriorates for fold 
changes of 2, with 124 false negatives (and 3 false 
positives) out of 1248 for c < 8pM, and 425 of 1248 
false negatives for c > 16 pM. 

Discussion 

We have shown that high-concentration bias is a seri- 
ous issue in GeneChips, which is probably related to 
chemical saturation in the adsorption process of the 
target to the probes. Exploiting the broad range of 
affinities of different probes {PM and MM included) 
offers an approach toward improvements. However, 
the gain in accuracy comes with an expected decrease 
in precision, since effectively, the weight of a measure- 
ment is transferred to a smaller set of probes. Hence, 
the reduction in noise levels resulting from averaging 
over probes is diminished. 

Our method should serve as a useful complement to 
those who use microarrays primarily as a gene discov- 
ery tool, and are interested in maximal signal detec- 
tion. In fact, we often hear that severe constraints like 
pharmaceutical treatments or gene knockouts appear 
to have no detectable transcriptional effects. While 
there is the possibility that transcription regulatory 
networks can compensate for such changes, or that 
some effects would be mostly post-transcriptional, 
real transcriptional changes may also be masked by 
compressive effects like those discussed above. 

Conclusion 

We have summarized the performance of existing 
methods for analyzing Affymetrix GeneChip data, 
using the yeast calibration dataset from Affymetrix. 
The results show unambiguously the compressive 
tendency of GeneChip measurements in the high- 
intensity range, namely that fold changes as large as 
4 in expression levels can be reduced to fold changes 
barely larger than 1 (Figure ||). Interestingly, we 
showed that among the standard techniques, the one 
using only the MM signals offers the highest accuracy 
at high concentrations. Additionally, we have how it 
is possible to achieve higher accuracy at high con- 
centrations by exploiting the probeset's wide affinity 
range. One should realize, however, that saturation 
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problems of the sort encountered present a hard chal- 
lenge in signal processing, and it is therefore expected 
that higher accuracy is obtained at the expense of re- 
duced precision. 

Our observations raise the following design issue for 
oligonucleotide arrays. Since it will likely be difficult 
to manufacture oligonucleotide probes with linear re- 
sponses over 4 or more decades in concentration, an 
option would be to optimize the design of probesets 
such that each of its probe would be optimally lin- 
ear in smaller ranges (say 2 decades at most) cen- 
tered around graded concentrations. In this way, the 
weights of a measurement could be transferred to an 
appropriate subset of probes known to be optimal in 
a given range. Hence, one would use a different set 
of probes for high or low concentration values to in- 
crease the overall dynamic range of a probeset. 
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Figure 1: Typical scatterplots from GeneChip data, a) Log transformed intensities for repeated hybridization 
conditions (duplicates), b) different conditions. The red lines show the lines of local standard deviation (SD=2) in 
the log-rations. 
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Figure 2: Typical compressive situation: a 2 dimensional cartoon. The open dots represent fictitious measurement 
of two probes (PMl, PM2) at increasing concentrations (from left to right). Probe PM2 saturates earlier than PMl. 
Ml = (m^, m^) represents the mean with uniform weights {wi} and M2 a mean obtained with weights that axe larger 
for high concentrations. Dl and D2 show the corresponding principal components (direction of largest variance). It 
is clear that projecting the points onto Dl has the effect of a compression due to the curvature. On the other hand, 
this compression is largely reduced at high-intensities by projecting onto D2. 
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Fig ure 3: Comparis on of 'two array' methods: MAS 5.0 (first column), PM — MM (second) and MMonly (third) 
of (Naef et al., 2002). Boxplots show the log base 2 ratio distributions for each baseline concentration group (cf. 
text). Row A: Fold change of 2, B: Fold change 4, C: Negative controls (false positives). The central rectangle 
indicates the median, 1st and 3rd quartiles (Ql and Q3). The "whiskers" extend on either side up to the last point 
which is not considered to be an outlier. Outliers are explicitly drawn and are defined as laying outside the interval 
[Ql — 1.5 * IQ, Q3 + 1.5 * IQ], where IQ — Q3 — Ql is the interquartile distance. Notice the two first rows are 
qualitatively similar, with the MAS 5.0 being marginally cleaner. Both methods show a strong high concentration 
compression, but have excellent reproducibility (cf. text). The third column illustrates that MM probes contain 
valuable signal, often leading to more accurate ratios at high concentrations. 
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Figure 4: Rows are as in Figure columns show the new method with uniform (first column), Cauchy weights 
introduced in the text (second), and the reference MAS5.0 (third). 
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Figure 5: P-values vs. log-ratios in for MAS 5.0. The plotted pmas is the transformed MAS 5.0 p-value (cf. 
text). The dotted line indicates the default 71 = 0.0025, below which MAS 5.0 scores are considered increased (I) or 
decreased (D) (for the transformed p-value). Colors are used to group baseline intensities of 0.25-1 pM (black), 2-8 
pM (red), 16-64 pM (green), 128-1024 (blue). 
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